Lab 2 Overview
First Half
More Visualisations:
Heatmaps
HexBins vs SquareBins
Choropleths
Descriptive Statistics:
Histograms and looking at the distribution shape.
How to “quick plot” dataframes using ggplot2.
Second Half
Revision:
- Any code related questions for R
Let’s get started
Visualisations
Binned plots
Good visualizations vs Bad visualizastions
Personal Checklist for Visualisations and Dashboards:
Your visualisation needs to tell a story.
It should be intepretable without being overly verbose.
The scale and axis need to make sense (and you can assume the reader knows the difference between a normal scale vs log scale).
The choice of visualisation needs to make sense:
- Line plot vs Bar chart with non-numerical categories
- Scatterplot vs Histogram plot to see distribution
- etc
- Choice of colour scheme / alpha / size need to be easy on the eyes.
At the end of the day, even if you think your visualisation is “pretty” or “beautiful”, if a reader cannot understand it, then it is not a good visualisation.
Load libraries
library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)
library(feather)Load data
#try feather
filepath = "/Volumes/you.y/MAST30034_R/data/df.feather"
df <- read_feather(filepath)
df %>% tail## # A tibble: 6 x 19
## VendorID tpep_pickup_dat… tpep_dropoff_da… passenger_count trip_distance
## <int> <fct> <fct> <int> <dbl>
## 1 2 4/12/15 22:55 4/12/15 23:28 2 12.8
## 2 2 4/12/15 22:55 4/12/15 23:03 1 0.75
## 3 1 4/12/15 22:55 4/12/15 23:08 1 2.4
## 4 1 4/12/15 22:55 4/12/15 23:01 1 0.8
## 5 2 4/12/15 22:55 4/12/15 23:17 1 4.73
## 6 2 4/12/15 22:55 4/12/15 22:59 2 0.8
## # … with 14 more variables: pickup_longitude <dbl>, pickup_latitude <dbl>,
## # RatecodeID <int>, store_and_fwd_flag <fct>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## # mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## # improvement_surcharge <dbl>, total_amount <dbl>
Heatmaps
- Excellent tool when visualising geospatial coordinates.
Select the map.
xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges## [1] -77.04710 -71.06483
yranges## [1] 37.27044 42.73614
map_big <- get_stamenmap(
rbind(-74.3,40.45,-73.75,40.95),
zoom = 10)
ggmap(map_big)Plot pickup locations
ggmap(map_big) +
geom_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) Change the color scale.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) +
scale_fill_viridis_c(trans = "log10")ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) +
scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")Filled by discrete factors.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(passenger_count)),
bins=50) Change the color palette. A good palette makes your story better.
library(RColorBrewer)
display.brewer.all()ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(payment_type)),
bins=100) +
scale_fill_brewer(palette = "Dark2")ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(payment_type),
alpha = 0.95),
bins=100) +
scale_fill_brewer(palette = "Dark2")Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d().
Plot HexBin.
ggmap(map_big) +
stat_binhex(data=df,
aes(x = pickup_longitude, y = pickup_latitude),
bins=100) +
coord_cartesian() +
scale_fill_gradient2(trans="log10")ggmap(map_big) +
stat_binhex(data=df,
aes(x = pickup_longitude, y = pickup_latitude),
alpha = 0.9,
bins=100) +
coord_cartesian() +
scale_fill_viridis_c(trans = "log10") +
facet_grid(.~payment_type)Shapefile Overlays
- NOTE: This only applies on the more recent datasets that use zones over coordinates
Requirements:
- ggplot2/ ggmap/ tmap/ leaflet
Shapefile Links:
taxi_zones.zip taxi+_zone_lookup.csv
df_yellow = read.csv("/Volumes/you.y/MAST30034_R/data/yellow_tripdata_2019-01.csv",stringsAsFactors = TRUE)
df_yellow %>%
group_by(PULocationID, passenger_count) %>%
summarise(n = sum(passenger_count)) -> df_yellow_summary## `summarise()` regrouping output by 'PULocationID' (override with `.groups` argument)
library(sf)sf = st_read("/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp")## Reading layer `taxi_zones' from data source
## `/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
zone = read.csv("/Volumes/you.y/MAST30034_R/data/shapefile/taxi+_zone_lookup.csv")Basic information in shapefile.
sf = st_read("/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/")## Reading layer `taxi_zones' from data source
## `/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones' using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
plot(sf)Choropleth mapping with ggplot2
ggplot() +
geom_sf(data = sf, size = 1, color = "black", fill = "cyan1") +
ggtitle("Boundary Plot") +
coord_sf()+
theme_bw()Adding basemaps with ggmap
ggmap(map_big) +
geom_sf(data = sf, aes(fill=Shape_Area), inherit.aes = FALSE) +
coord_sf(crs = st_crs(4326)) +
scale_fill_viridis_c()Choropleth with tmap
tmap is specifically designed to make creation of thematic maps more convenient. It borrows from thw ggplot syntax and takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. We only need:
tm_shape() where we provide the sf object (we could also provide an SpatialPolygonsDataframe)
tm_polygons() where we set the attribute variable to map, the break style, and a title.
tm_shape(sf)+
tm_polygons("Shape_Area",style="quantile")Web mapping with leaflet
library(leaflet) Incorporate the information we have.
df_yellow_summary$LocationID <- df_yellow_summary$PULocationID
full_join(sf,df_yellow_summary,by="LocationID") -> sf_fullWhile tmap was tolerant about our AEA projection of philly_crimes_sf, leaflet does require us to explicitly reproject the sf object.
# reproject
sf_WGS84 <- st_transform(sf_full, 4326)
leaflet(sf_WGS84) %>%
addPolygons()sf_WGS84 %>% filter(!is.na(n)) ->sf_plot
pal_fun <- colorQuantile("YlOrRd", NULL, n = 5)
p_popup <- paste0("<strong>Passanger number: </strong>", sf_plot$n)
leaflet(sf_plot) %>%
addPolygons(
stroke = FALSE,
fillColor = ~pal_fun(n),
fillOpacity = 0.8, smoothFactor = 0.5,
popup = p_popup) %>%
addTiles() %>%
addLegend("bottomright", # location
pal=pal_fun, # palette function
values=~n, # value to be passed to palette function
title = 'Passanger number') # legend titleDescriptive Statistics
filepath = "/Volumes/you.y/MAST30034_R/data/df.feather"
df <- read_feather(filepath)summary(df)## VendorID tpep_pickup_datetime tpep_dropoff_datetime
## Min. :1.000 4/12/15 19:14: 508 4/12/15 21:00: 487
## 1st Qu.:1.000 4/12/15 22:44: 506 4/12/15 21:40: 486
## Median :2.000 4/12/15 21:57: 503 4/12/15 20:14: 484
## Mean :1.528 4/12/15 22:46: 491 4/12/15 20:58: 480
## 3rd Qu.:2.000 4/12/15 19:39: 489 4/12/15 21:01: 480
## Max. :2.000 4/12/15 19:28: 486 4/12/15 20:31: 476
## (Other) :97017 (Other) :97107
## passenger_count trip_distance pickup_longitude pickup_latitude
## Min. :0.000 Min. : 0.000 Min. :-77.05 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 1.030 1st Qu.:-73.99 1st Qu.:40.73
## Median :1.000 Median : 1.710 Median :-73.98 Median :40.75
## Mean :1.704 Mean : 2.827 Mean :-72.88 Mean :40.14
## 3rd Qu.:2.000 3rd Qu.: 3.100 3rd Qu.:-73.97 3rd Qu.:40.77
## Max. :6.000 Max. :91.200 Max. : 0.00 Max. :42.74
##
## RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude
## Min. : 1.000 N:99421 Min. :-74.62 Min. : 0.00
## 1st Qu.: 1.000 Y: 579 1st Qu.:-73.99 1st Qu.:40.73
## Median : 1.000 Median :-73.98 Median :40.75
## Mean : 1.029 Mean :-72.93 Mean :40.17
## 3rd Qu.: 1.000 3rd Qu.:-73.96 3rd Qu.:40.77
## Max. :99.000 Max. : 0.00 Max. :41.49
##
## payment_type fare_amount extra mta_tax
## Min. :1.00 Min. :-120.00 Min. :-1.0000 Min. :-0.5000
## 1st Qu.:1.00 1st Qu.: 7.00 1st Qu.: 0.5000 1st Qu.: 0.5000
## Median :1.00 Median : 10.00 Median : 0.5000 Median : 0.5000
## Mean :1.34 Mean : 12.75 Mean : 0.6049 Mean : 0.4985
## 3rd Qu.:2.00 3rd Qu.: 15.00 3rd Qu.: 0.5000 3rd Qu.: 0.5000
## Max. :4.00 Max. : 500.00 Max. : 1.5000 Max. : 2.5000
##
## tip_amount tolls_amount improvement_surcharge total_amount
## Min. : 0.00 Min. : 0.0000 Min. :-0.3000 Min. :-120.30
## 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 9.30
## Median : 1.46 Median : 0.0000 Median : 0.3000 Median : 12.80
## Mean : 1.80 Mean : 0.2386 Mean : 0.2997 Mean : 16.19
## 3rd Qu.: 2.55 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 18.36
## Max. :115.00 Max. :24.0000 Max. : 0.3000 Max. : 550.30
##
Remember that not all the data should be interpreted as purely numerical. There may be conclusions you can draw by coincidence if you incorrectly assume data types! For example:
longitude and latitude should be interpreted as geospatial coordinates
payment_type is a discrete category of payment types
trip_distance is non-linear (not a straight line from A to B), but we have no further data on it. It is also in miles.
To avoid misinterpreting the attributes, refer to the data dictionary provided on the TLC website.
Scatterplot plots for Fare / Distance
ggplot(df) +
aes(x=fare_amount,y=trip_distance) +
geom_point()+
theme_bw()General Inference:
- We can visually see that the relationship is relatively linear as you’d expect (more distance generally means more money)
- A fair number of outliers, notably around the 0 distance axis and 0 cost axis. Why might this be the case?
- Why are there negative values?
Histgram plots for Fare amount
ggplot(df)+
geom_histogram(aes(x=fare_amount),bins =30, fill="darkblue") +
theme_bw()- I guess we can kind of see most fares are between 0 - 100.
- Hard to tell where the main distribution is spread around.
- We can enhance this using a log transformation.
df$transform_fare_amount <- log(df$fare_amount)
ggplot(df)+
geom_histogram(aes(x=transform_fare_amount),bins = 30, fill="darkblue") +
theme_bw()- Take a log transformation to visually see the distribution
Do a correlation check.
cor(df$trip_distance,df$fare_amount,method = "pearson")## [1] 0.899978
library(pheatmap)
#remove characters
pheatmap(cor(df[,-c(1:3,9)]),
cluster_rows = FALSE,
cluster_cols = FALSE)- trip_distance highly correlates with high tips, tolls and overall trip amount
- payment_type seems to have some form of negative correlation with tip_amount. Must be careful as this is a discrete category.
cors =c("passenger_count", "trip_distance", "fare_amount", "extra",
"mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount")
cor(df[,cors])## passenger_count trip_distance fare_amount extra
## passenger_count 1.0000000000 0.001658362 0.0009505592 -0.003093358
## trip_distance 0.0016583622 1.000000000 0.8999780236 -0.223323624
## fare_amount 0.0009505592 0.899978024 1.0000000000 -0.207022734
## extra -0.0030933583 -0.223323624 -0.2070227338 1.000000000
## mta_tax 0.0131316387 -0.027817098 -0.1282543385 0.130305153
## tip_amount -0.0111149872 0.530623359 0.5573861481 -0.100025231
## tolls_amount 0.0065090920 0.572059235 0.5307124782 -0.157824284
## improvement_surcharge 0.0040463644 0.015943787 0.0489414626 0.095486378
## total_amount -0.0007104667 0.898539215 0.9824415224 -0.186521180
## mta_tax tip_amount tolls_amount
## passenger_count 0.01313164 -0.01111499 0.006509092
## trip_distance -0.02781710 0.53062336 0.572059235
## fare_amount -0.12825434 0.55738615 0.530712478
## extra 0.13030515 -0.10002523 -0.157824284
## mta_tax 1.00000000 -0.06226707 -0.094005581
## tip_amount -0.06226707 1.00000000 0.359383270
## tolls_amount -0.09400558 0.35938327 1.000000000
## improvement_surcharge 0.59152689 0.01722265 0.004483847
## total_amount -0.12168476 0.68568923 0.603590674
## improvement_surcharge total_amount
## passenger_count 0.004046364 -0.0007104667
## trip_distance 0.015943787 0.8985392152
## fare_amount 0.048941463 0.9824415224
## extra 0.095486378 -0.1865211796
## mta_tax 0.591526892 -0.1216847619
## tip_amount 0.017222652 0.6856892276
## tolls_amount 0.004483847 0.6035906745
## improvement_surcharge 1.000000000 0.0486015276
## total_amount 0.048601528 1.0000000000
pheatmap(cor(df[,cors]))Other plots to present correlations.
library(corrplot)## corrplot 0.90 loaded
corrplot(cor(df[,cors]), method = 'number')corrplot(cor(df[,cors]), method = 'color', order = 'alphabet') corrplot(cor(df[,cors]), method = 'color', order = 'AOE') # after 'AOE' reorderStudy Payment type and Distance
library(forcats)
df = df %>%
mutate(payment_type = as_factor(payment_type)) %>%
mutate(payment_type = recode(payment_type,`1` = "Credit Card", `2` = "Cash", `3` = "No charge", `4` = "Dispute"))# Check category frequencies
fct_count(df$payment_type)## # A tibble: 4 x 2
## f n
## <fct> <int>
## 1 Credit Card 66601
## 2 Cash 32955
## 3 No charge 336
## 4 Dispute 108
#Collapse Cash/No charge/Dispute
df = df %>%
mutate(payment_type = fct_collapse(payment_type,
Credit_Card = c("Credit Card"),
Cash_Or_Others=c("Cash","No charge","Dispute")))ggplot(df, aes(payment_type)) +
geom_bar(fill=c("pink","orange")) +
ylab("Frequency") +
ggtitle("Barplot of Payment Methods")+
theme_bw()ggplot(df) +
geom_histogram(aes(x=trip_distance, fill=payment_type))+
theme_bw()ggplot(df) +
geom_histogram(aes(x=trip_distance, fill=payment_type),
breaks=seq(0,20, by=1))+
theme_bw()+
scale_fill_brewer(palette = "Paired") Use facets.
ggplot(df) +
geom_histogram(aes(x=trip_distance, fill=payment_type),
breaks=seq(0,20, by=1))+
theme_bw()+
facet_grid(.~payment_type) +
scale_fill_brewer(palette = "Paired")Use Spark in R
See sparklyr tutorial.
SessionInfo
sessionInfo()## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 corrplot_0.90 pheatmap_1.0.12 leaflet_2.0.4.1
## [5] RColorBrewer_1.1-2 feather_0.3.5 tmaptools_3.1-1 tmap_3.3-2
## [9] ggmap_3.0.0 ggplot2_3.3.2 curl_4.3.1 sf_1.0-0
## [13] dplyr_1.0.2
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 tidyr_1.1.2 jsonlite_1.7.1
## [4] viridisLite_0.3.0 assertthat_0.2.1 sp_1.4-5
## [7] yaml_2.2.1 pillar_1.4.6 lattice_0.20-41
## [10] glue_1.4.2 digest_0.6.27 leaflet.providers_1.9.0
## [13] colorspace_2.0-0 htmltools_0.5.1.1 plyr_1.8.6
## [16] XML_3.99-0.5 pkgconfig_2.0.3 raster_3.4-10
## [19] stars_0.5-3 s2_1.0.5 bookdown_0.22
## [22] purrr_0.3.4 scales_1.1.1 jpeg_0.1-8.1
## [25] tibble_3.0.4 proxy_0.4-25 generics_0.1.0
## [28] farver_2.0.3 ellipsis_0.3.1 withr_2.3.0
## [31] leafsync_0.1.0 hexbin_1.28.2 cli_2.1.0
## [34] magrittr_1.5 crayon_1.3.4 evaluate_0.14
## [37] fansi_0.4.1 lwgeom_0.2-6 class_7.3-18
## [40] tools_4.0.5 hms_0.5.3 RgoogleMaps_1.4.5.3
## [43] lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0
## [46] compiler_4.0.5 e1071_1.7-7 rlang_0.4.8
## [49] classInt_0.4-3 units_0.7-1 grid_4.0.5
## [52] dichromat_2.0-0 rjson_0.2.20 htmlwidgets_1.5.2
## [55] crosstalk_1.1.0.1 leafem_0.1.6 bitops_1.0-6
## [58] base64enc_0.1-3 labeling_0.4.2 rmarkdown_2.6
## [61] wk_0.4.1 gtable_0.3.0 codetools_0.2-18
## [64] abind_1.4-5 DBI_1.1.0 R6_2.5.0
## [67] knitr_1.30 utf8_1.1.4 KernSmooth_2.23-18
## [70] stringi_1.5.3 parallel_4.0.5 rmdformats_1.0.3
## [73] Rcpp_1.0.5 vctrs_0.3.4 png_0.1-7
## [76] tidyselect_1.1.0 xfun_0.24